MATH 70076 - Data Science
{ggplot2}Tidy Tuesday, a weekly community activity challenge.
Explore, visualise, model or summarise a new data set each week.
Use the 2024-01-23 dataset.
The dataset comes from The UK Office for National Statistics.
It was explored in the July 2023 article “Why do children and young people in smaller towns do better academically than those in larger towns?”.
Reproduce this plot from the article.
2024-01-23_tidy-tuesday.R
# Tidy Tuesday 2024-01-23 English Education
# Load required packages -------------------------------------------------------
# Load data --------------------------------------------------------------------
# Data wrangling ---------------------------------------------------------------
# Create plot ------------------------------------------------------------------
# Save plot --------------------------------------------------------------------
# End of fileBefore we get started coding it is a good idea to take some time with a pen and paper to work out exactly what data we need to construct this plot and what form we need it in.
This stage would usually come after some exploration of the data and a rough hand-sketch of the plot we want to make. Our job is a bit simpler in this case, because we are reconstructing an existing plot. This means we don’t have to make as many decisions of our own, instead we just have to pay careful attention to the choices made in the original plot.
03:00
town_scoreslabel_df and arrow_df
town_scoresTake a look at the structure of the data we have loaded and try to construct town_scores.
[1] "town11cd"
[2] "town11nm"
[3] "population_2011"
[4] "size_flag"
[5] "rgn11nm"
[6] "coastal"
[7] "coastal_detailed"
[8] "ttwa11cd"
[9] "ttwa11nm"
[10] "ttwa_classification"
[11] "job_density_flag"
[12] "income_flag"
[13] "university_flag"
[14] "level4qual_residents35_64_2011"
[15] "ks4_2012_2013_counts"
[16] "key_stage_2_attainment_school_year_2007_to_2008"
[17] "key_stage_4_attainment_school_year_2012_to_2013"
[18] "level_2_at_age_18"
[19] "level_3_at_age_18"
[20] "activity_at_age_19_full_time_higher_education"
[21] "activity_at_age_19_sustained_further_education"
[22] "activity_at_age_19_appprenticeships"
[23] "activity_at_age_19_employment_with_earnings_above_0"
[24] "activity_at_age_19_employment_with_earnings_above_10_000"
[25] "activity_at_age_19_out_of_work"
[26] "highest_level_qualification_achieved_by_age_22_less_than_level_1"
[27] "highest_level_qualification_achieved_by_age_22_level_1_to_level_2"
[28] "highest_level_qualification_achieved_by_age_22_level_3_to_level_5"
[29] "highest_level_qualification_achieved_by_age_22_level_6_or_above"
[30] "highest_level_qualification_achieved_b_age_22_average_score"
[31] "education_score"
Referring to the data dictionary, it seems that the relevant columns are likely to be:
rgn11nm (character) the region name as recorded in 2011;education_score (numeric) the town/city education score based on attainment levels of the 2012/13 Key stage 4 cohort;income_flag (character) a variable used to describe towns as lower income deprivation, mid income deprivation or higher income deprivation.Our first step is to consider only the rows which correspond to towns. We can do this using the income_flag.
We filter to only rows there the income_flag contains the string "town".
# A tibble: 1,082 × 31
town11cd town11nm population_2011 size_flag rgn11nm coastal coastal_detailed
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 E34000007 Carlton… 5456 Small To… East M… Non-co… Smaller non-coa…
2 E34000016 Dorches… 19060 Small To… South … Non-co… Smaller non-coa…
3 E34000020 Ely BUA 19090 Small To… East o… Non-co… Smaller non-coa…
4 E34000026 Market … 6429 Small To… Yorksh… Non-co… Smaller non-coa…
5 E34000027 Downham… 10884 Small To… East o… Non-co… Smaller non-coa…
6 E34000039 Penrith… 15181 Small To… North … Non-co… Smaller non-coa…
7 E34000048 Bolsove… 11754 Small To… East M… Non-co… Smaller non-coa…
8 E34000055 March B… 21051 Medium T… East o… Non-co… Large non-coast…
9 E34000056 Southam… 6567 Small To… West M… Non-co… Smaller non-coa…
10 E34000067 Royston… 15781 Small To… East o… Non-co… Smaller non-coa…
# ℹ 1,072 more rows
# ℹ 24 more variables: ttwa11cd <chr>, ttwa11nm <chr>,
# ttwa_classification <chr>, job_density_flag <chr>, income_flag <chr>,
# university_flag <chr>, level4qual_residents35_64_2011 <chr>,
# ks4_2012_2013_counts <dbl>,
# key_stage_2_attainment_school_year_2007_to_2008 <dbl>,
# key_stage_4_attainment_school_year_2012_to_2013 <dbl>, …
We next provide more meaningful names, format the columns we will keep and drop all irrelevant columns.
Console
# A tibble: 1,082 × 3
region income education_score
<fct> <fct> <dbl>
1 East Midlands Higher deprivation towns -0.534
2 South West Mid deprivation towns 1.95
3 East of England Lower deprivation towns -1.04
4 Yorkshire and The Humber Lower deprivation towns -1.25
5 East of England Higher deprivation towns -1.17
6 North West Lower deprivation towns 0.845
7 East Midlands Higher deprivation towns -3.72
8 East of England Higher deprivation towns -2.17
9 West Midlands Lower deprivation towns 6.44
10 East of England Lower deprivation towns 0.295
# ℹ 1,072 more rows
To calculate our average education score in each region-income combination, we have to group our observations by both of these factors.
We can then calculate a the required summary statistic for each group.
# A tibble: 24 × 3
# Groups: region [8]
region income mean_education_score
<fct> <fct> <dbl>
1 East Midlands Higher deprivation towns -2.96
2 East Midlands Lower deprivation towns 2.23
3 East Midlands Mid deprivation towns -1.18
4 East of England Higher deprivation towns -3.35
5 East of England Lower deprivation towns 2.60
6 East of England Mid deprivation towns -1.92
7 North East Higher deprivation towns -2.14
8 North East Lower deprivation towns 4.63
9 North East Mid deprivation towns 1.42
10 North West Higher deprivation towns -1.30
# ℹ 14 more rows
summarise()has grouped output by ‘region’. You can override using the.groupsargument.
This warning reminds us that this summarise() call returns a tibble that is still grouped by region. To avoid any mistakes later, let’s undo this grouping.
town_scores
2024-01-23_tidy-tuesday.R # Data Wrangling ----
label_dfThe first column in labels_df should be the mean education scores for the North West region. We can easily extract these from town_scores, which we just created.
# A tibble: 3 × 1
mean_education_score
<dbl>
1 -1.30
2 5.00
3 1.72
The issue here is that we still have a tibble. If we want to extract a column as a numeric vector then we can use pull instead of select.
NW_scoresConsole
[1] -1.302181 5.004601 1.721192
arrow_dfWe can similarly construct arrow_df.
2024-01-23_tidy-tuesday.R # Data Wrangling ----
town_scores <- education %>%
filter(str_detect(income_flag, pattern = "town")) %>%
mutate(region = as.factor(rgn11nm), income = as.factor(income_flag)) %>%
select(region, income, education_score) %>%
group_by(region, income) %>%
summarise(mean_education_score = mean(education_score)) %>%
ungroup()
NW_scores <- town_scores %>%
filter(region == "North West") %>%
pull(mean_education_score)
label_df <- tibble(
horizontal = NW_scores,
vertical = rep(9, 3),
string = c("Higher income deprivation", "Lower", "Mid")
)
arrow_df <- tibble(
horizontal = NW_scores,
upper = rep(8.75, 3),
lower = rep(8.25, 3)
)The final step in our data preparations is to create some variables that we can use later to access the colours and texts strings that we will use when creating the plot.
2024-01-23_tidy-tuesday.R # Data Wrangling ----
# Note added line breaks (\n) to match original formatting
strings <- list(
title = "Towns in the North West have the highest attainment scores at
all \n income deprivation levels",
sub = "\n \n Average educational attainment score for towns, by region
and income \n deprivation level, England",
cap = "Source: Office for National Statistics analysis using Longitudinal
Education Outcomes (LEO) \n from the Department for Education (DfE) and Index
of Multiple Deprivation 2019 from the \n Department for Levelling Up, Housing
and Communities (DLUHC)"
)
ONS_palette <- list(
red = "#7c265a",
blue = "#315f91",
light_grey = "#d3d3d3",
dark_grey = "#8A8A8A",
black = "#222222",
white = "#ffffff"){ggplot2}We will use {ggplot2} to create our visualisation, so we have to add this to our library calls.
Using geom_point() create a scatter plot. We want to use town_scores as the source of our data, mapping: the mean_score column to the x location of each point, the region column to the y location of each point, the income column to determine the colour of each point.
Switch to solid circles, plotting character 21.
fill for hollow shapesWe can add another layer to our plot, to thicken the vertical line \(x=0\).
theme_minimal()We can use theme() to correct other cosmetic problems, such as removing the minor vertical grid lines and making the horizontal grid lines dashed.
coord_cartesian()This has fixed one problem but created another. Now our grid lines extend up too far. Let’s replace them with line segments using geom_linerange().
Remove the gridlines using theme() and define lines we want to draw.
Add those lines that we just defined. Beautiful!
p <- ggplot() +
geom_point(
data = town_scores,
mapping = aes(
x = mean_education_score,
y = region,
fill = income),
shape = 21,
) +
geom_linerange(
mapping = aes(x = 0, ymin = 0.25, ymax = 8.25),
colour = ONS_palette$dark_grey,
linewidth = 0.8) +
geom_linerange(
data = vertical_grid_df,
mapping = aes(x = x, ymin = ymin, ymax = ymax),
colour = ONS_palette$light_grey,
linewidth = 0.3) +
coord_cartesian(xlim = c(-4, 6), ylim = c(1.25,8.75))
p +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linetype = 2)
)Now let’s add text to the plot. That looks disgusting.
p <- p +
labs(title = strings$title,
subtitle = strings$sub,
caption = strings$cap)
p +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linetype = 2)
){ggtext}theme_ONS <- theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linetype = 2),
plot.title = element_markdown(
size = 15,
hjust = 0,
face = "bold",
margin = margin(0,0,15,0)),
plot.title.position = "plot",
plot.subtitle = element_markdown(
size = 14,
hjust = 0,
margin = margin(0,0,15,0)),
plot.caption = element_markdown(
size = 12,
hjust = 0,
color = ONS_palette$dark_grey),
plot.caption.position = "plot")
p +
theme_minimal() +
theme_ONSTo get the locations arranged with Northern places at the top and Southern places at the bottom, we need to re-level the location factor in town_scores.
library(forcats)
town_scores <- town_scores %>%
mutate(region = fct_relevel(region,
"South West",
"East of England",
"South East",
"East Midlands",
"West Midlands",
"Yorkshire and The Humber",
"North East",
"North West"))
p <- ggplot() +
geom_point(
data = town_scores,
mapping = aes(
x = mean_education_score,
y = region,
fill = income),
shape = 21) +
geom_linerange(
mapping = aes(x = 0, ymin = 0.25, ymax = 8.25),
colour = ONS_palette$dark_grey,
linewidth = 0.8) +
geom_linerange(
data = vertical_grid_df,
mapping = aes(x = x, ymin = ymin, ymax = ymax),
colour = ONS_palette$light_grey,
linewidth = 0.3) +
coord_cartesian(xlim = c(-4, 6), ylim = c(1.25,8.75)) +
labs(title = strings$title, subtitle = strings$sub, caption = strings$cap)
p + theme_minimal() + theme_ONSAdd a line break into the third level of the location factor.
town_scores <- town_scores %>%
mutate(region = fct_recode(region,
"Yorkshire and\n The Humber" = "Yorkshire and The Humber"))
p <- ggplot() +
geom_point(
data = town_scores,
mapping = aes(
x = mean_education_score,
y = region,
fill = income),
shape = 21,
) +
geom_linerange(
mapping = aes(x = 0, ymin = 0.25, ymax = 8.25),
colour = ONS_palette$dark_grey,
linewidth = 0.8) +
geom_linerange(
data = vertical_grid_df,
mapping = aes(x = x, ymin = ymin, ymax = ymax),
colour = ONS_palette$light_grey,
linewidth = 0.3) +
coord_cartesian(xlim = c(-4, 6), ylim = c(1.25,8.75)) +
labs(title = strings$title, subtitle = strings$sub, caption = strings$cap)
p + theme_minimal() + theme_ONSMake the plotting characters larger and switch to a custom ONS colour scheme.
p <- ggplot() +
geom_point(
data = town_scores,
mapping = aes(
x = mean_education_score,
y = region,
fill = income),
shape = 21,
size = 4) +
geom_linerange(
mapping = aes(x = 0, ymin = 0.25, ymax = 8.25),
colour = ONS_palette$dark_grey,
linewidth = 0.8) +
geom_linerange(
data = vertical_grid_df,
mapping = aes(x = x, ymin = ymin, ymax = ymax),
colour = ONS_palette$light_grey,
linewidth = 0.3) +
coord_cartesian(xlim = c(-4, 6), ylim = c(1.25,8.75)) +
labs(title = strings$title, subtitle = strings$sub, caption = strings$cap) +
scale_fill_manual(values = unname(ONS_palette))
p + theme_minimal() + theme_ONSWe can add a layer of text annotations to our plot using geom_text().
Similarly, geom_segment() can be adapted to draw arrows. To avoid these overlapping with our text, we also have to nudge our labels up a little bit.
p <- p +
geom_text(
data = label_df,
mapping = aes(
x = horizontal,
y = vertical,
label = string),
color = ONS_palette$black) +
geom_segment(
data = arrow_df,
mapping = aes(
x = horizontal,
xend = horizontal,
y = upper,
yend = lower),
arrow = arrow(length = unit(0.1, "inches"))
)
p + theme_minimal() + theme_ONS2024-01-23_tidy-tuesday.R
# Tidy Tuesday 2024-01-23 English Education
# Load required packages -------------------------------------------------------
library(readr) # loading data
library(dplyr) # data frame manipulation
library(stringr) # character string manipulation
library(ggplot2) # plotting
library(ggtext) # element_markdown for subtitle formatting
library(forcats) # working with factors
# Load data --------------------------------------------------------------------
url <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-23/english_education.csv'
education <- readr::read_csv(file = url)
# Data wrangling ---------------------------------------------------------------
## Main plot ---------------------------------
town_scores <- education %>%
filter(str_detect(income_flag, pattern = "town")) %>%
mutate(region = as.factor(rgn11nm), income = as.factor(income_flag)) %>%
select(region, income, education_score) %>%
group_by(region, income) %>%
summarise(mean_education_score = mean(education_score)) %>%
ungroup() %>%
mutate(region = fct_relevel(region,
"South West",
"East of England",
"South East",
"East Midlands",
"West Midlands",
"Yorkshire and The Humber",
"North East",
"North West"))%>%
mutate(region = fct_relevel(
region,
"South West",
"East of England",
"South East",
"East Midlands",
"West Midlands",
"Yorkshire and The Humber",
"North East",
"North West")) %>%
mutate(region = fct_recode(region,
"Yorkshire and\n The Humber" = "Yorkshire and The Humber"))
## Annotations ---------------------------------
NW_scores <- town_scores %>%
filter(region == "North West") %>%
pull(mean_education_score)
label_df <- tibble(
horizontal = NW_scores,
vertical = rep(9, 3),
string = c("Higher income deprivation", "Lower", "Mid")
)
arrow_df <- tibble(
horizontal = NW_scores,
upper = rep(8.75, 3),
lower = rep(8.25, 3)
)
strings <- list(
title = "Towns in the North West have the highest attainment scores at
all \n income deprivation levels",
sub = "\n \n Average educational attainment score for towns, by region
and income \n deprivation level, England",
cap = "Source: Office for National Statistics analysis using Longitudinal
Education Outcomes (LEO) \n from the Department for Education (DfE) and Index
of Multiple Deprivation 2019 from the \n Department for Levelling Up, Housing
and Communities (DLUHC)"
)
ONS_palette <- list(
red = "#7c265a",
blue = "#315f91",
light_grey = "#d3d3d3",
dark_grey = "#8A8A8A",
black = "#222222",
white = "#ffffff")
vertical_grid_df <- tibble(
x = seq(-4, 6, by = 2),
ymin = rep(0, length(x)),
ymax = rep(8.25, length(x))
)
# Create plot ------------------------------------------------------------------
p <- ggplot() +
geom_point(
data = town_scores,
mapping = aes(
x = mean_education_score,
y = region,
fill = income),
shape = 21,
size = 4) +
geom_linerange(
mapping = aes(x = 0, ymin = 0.25, ymax = 8.25),
colour = ONS_palette$dark_grey,
linewidth = 0.8) +
geom_linerange(
data = vertical_grid_df,
mapping = aes(x = x, ymin = ymin, ymax = ymax),
colour = ONS_palette$light_grey,
linewidth = 0.3) +
coord_cartesian(xlim = c(-4, 6), ylim = c(1.25,8.75)) +
labs(title = strings$title, subtitle = strings$sub, caption = strings$cap) +
scale_fill_manual(values = unname(ONS_palette)) +
geom_text(
data = label_df,
mapping = aes(
x = horizontal,
y = vertical,
label = string),
color = ONS_palette$black) +
geom_segment(
data = arrow_df,
mapping = aes(
x = horizontal,
xend = horizontal,
y = upper,
yend = lower),
arrow = arrow(length = unit(0.1, "inches"))) +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(size = 12, colour = ONS_palette$black),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linetype = 2),
plot.title = element_markdown(size = 15,
hjust = 0,
face = "bold",
margin = margin(0,0,15,0)),
plot.title.position = "plot",
plot.subtitle = element_markdown(size = 14,
hjust = 0,
margin = margin(0,0,15,0)),
plot.caption = element_markdown(size = 12,
hjust = 0,
color = ONS_palette$dark_grey),
plot.caption.position = "plot")
# Save plot --------------------------------------------------------------------
out_path <- "2024-01-23-english-education"
ggsave(
plot = p,
filename = paste0(out_path, ".png"),
device = "png",
width = 7.5,
height = 6.5,
units = "in",
dpi = 300,
bg = ONS_palette$white
)
# End of fileEffective Data Science: EDAV - ggplot2 demo - Zak Varty